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Information maximization has been investigated as a possible mechanism of learning governing 
the self-organization that occurs within the neural systems of animals. Within the general context 
of models of neural systems bidirectionally interacting with environments, however, the role of in¬ 
formation maximization remains to be elucidated. For bidirectionally interacting physical systems, 
universal laws describing the fluctuation they exhibit and the information they possess have recently 
been discovered. These laws are termed fluctuation theorems. In the present study, we formulate 
a theory of learning in neural networks bidirectionally interacting with environments based on the 
principle of information maximization. Our formulation begins with the introduction of a general¬ 
ized fluctuation theorem, employing an interpretation appropriate for the present application, which 
differs from the original thermodynamic interpretation. We analytically and numerically demon¬ 
strate that the learning mechanism presented in our theory allows neural networks to efficiently 
explore their environments and optimally encode information about them. 

PACS numbers: 05.40.-a, 84.35. +i, 87.19.lo, 89.70.-a 


Introduction : The neural systems of animals are 
prominent as highly efficient systems for processing 
information concerning the external environment. Many 
authors have argued that the learning capability of 
neural systems is information-theoretically optimal by 
showing that several features of neural activity can be 
accounted for by positing information maximization 
(Infomax) in the learning of sensory signals and intrinsic 
dynamics in neural circuits [H-Q. However, Infomax 
has not yet been investigated in a general context. 
In particular, although a real neural system interacts 
bidirectionally with its environment, not only receiving 
sensory signals and organizing intrinsic dynamics accord¬ 
ingly, but also generating motor outputs that influence 
the environment, Infomax learning has not been clearly 
formulated in this context (see Fig. 1(a)). For example, 
the formulation of Infomax must be generalized in order 
to facilitate its application to the following type of 
model. One of the standard models of the interaction 
between neural systems and environments employs the 
Markov decision process. In this model, we consider 
discrete-time {t € Z) stationary Markovian dynamics 
of a stochastic neural network with N binary-valued 
neurons cc* G {0, l}'^ interacting with an environment y* 
that takes values in a discrete state space y The 

neural elements {a:*} (1 < i < N) receive inputs from 
the environment in such a manner that realizes values 
stochastically according to a conditional probability 
7r(a:*|y*). This conditional probability depends on the 
model parameters, such as the synaptic strength, and it 
changes slowly during the learning process through the 
adjustment of the model parameters. Then, the state of 


(a) environment y* 




FIG. 1. (Color), (a) Interactions between a neural system 
and its environment, (b) Representation of the dynamics as a 
causal network. 


the environment at the next timestep, is obtained 
stochastically with a transition probability ti{y*~^^\y*,x^) 
that is determined by the current states of the neural 
network and environment. 


For interacting physical systems, there are recently 
discovered universal laws called fluctuation theorems 
@■14 1 that relate nonequilibrium physical quantities to 
informational quantities such as mutual information. 
In particular, the dynamics of the neural network and 
the environment mentioned above are represented in 
the form of a causal network with regard to which 
a generalized version of the fluctuation theorem has 
been investigated 0 (Fig. 1(b)). Because fluctua¬ 
tion theorems describe informational quantities for 
interacting systems, it is natural to hypothesize that 
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they may provide a description of a key aspect of the 
learning behavior exhibited by neural systems. Although 
information thermodynamic considerations have been 
investigated in the context of learning systems in a few 
pioneering studies [3[3 , it has not been determined 
whether informational quantities are actually maximized 
in such systems in some systematic way. In this Letter, 
we study this question. 


We derive a novel type of Infomax learning, start¬ 
ing from the version of the integral fluctuation theorem 
presented in 1^ , which provides the following inequality 
relating the average entropy production E[tT] of the 
neural network and the transfer entropy Ix-^y from the 
neural network to its environment: 


>-E[cr], cr = log 


7r(x*+^ly*+^) 


( 1 ) 


Throughout this article, the expectation value E is taken 
with respect to the stationary distribution ps of the dy¬ 
namics unless otherwise noted. The transfer entropy is 
defined as the conditional mutual information 17l| : 


( 2 ) 

As in information theory 0, the (conditional) mutual 
information between two variables is defined as the 
change in the (conditional) entropy of one of the two 
variables owing to the inclusion of the other vari¬ 
able as a conditioning variable. Explicitly, we have 
l[y*'^^-,x*\y*] = The (con¬ 

ditional) entropy is defined in terms of the stationary 
distribution as = —E[logPs(?/‘+^|?/*)]. The 

quantity Ix^y represents the amount of information that 
the neural system possesses about the future state of the 
environment. Thus, from the point of view of Infomax, it 
is a reasonable hypothesis that maximizing this quantity 
is an effective learning mechanism. However, it is 
necessary for the calculation of Ix^y to directly estimate 
the transition probability of the environment, y, and 
this estimation apparently cannot be carried out by the 
neural network itself. Its lower bound, —E[(t], on the 
other hand, can be computed within the neural network, 
because cr is determined by the transition probability 
of the neural network, tt, alone. With these in mind, 
it is natural to conjecture that neural systems attempt 
to optimize their acquisition of information about the 
future by adjusting tt in such a manner to maximize the 
quantity —E[(t]. However, note that the equality in the 
relation, Ix^y > —E[(t], is not generally realized, and 
thus the maximization of —E[cr] does not necessarily 
imply the maximization of Ix^y In the next, we show 
how consideration of a generalized entropy production, 
allows us to overcome this problem. 


Generalized Fluctuation Theorem : We prove the 


following inequality below: 


Ia;->y > E[0j^]. 


( 3 ) 


Here, we define the following generalized forms of the 
entropy production in terms of a conditional distribution 
v. 


0 . 


log 


^(x*+l|y*+l) 
v{x^\x^+^,y*)' 


= 0f, -b log 


Psjx^W) 


( 4 ) 


We can regard v as representing physical quantities com¬ 
puted in the neural system on the basis of x* ,y*, and 

adjusted through learning (see supplemental materials). 
First, we have the apparent identity 


exp 


logp,(x‘|y*) + \ogTr{x*+^\y*+^) - 0 


= exp [logps(a:*+^|j/*+^)-blogj^(x‘|a;‘+\j/*)] . (5) 


Multiplying both sides by Ps , y*) and summing them 
over relevant random variables, we obtain a generalized 
form of the fluctuation theorem: 


E 



= 1 

^tr 


Ps(y*+^|a:*,y*) 

Ps(y‘+My‘) 


( 6 ) 


Applying Jensen’s inequality (expE[F(Z)] < 
E[expE(Z)], which applies to any random variable 
Z and any function F) to Eq.d®]) gives 


E 



< 0 . 


( 7 ) 


Noting that E[0(,] = E[0(,] and Ix^y = E[J()J'^], we have 
the inequality in Eq.Q. It is found that, for fixed tt, the 
right-hand side of Eq.Q is maximal if and only if 

Kx‘|cr‘+\y‘)=p.(x‘|a:‘+i,y‘). (g) 

Furthermore, we can prove that the equality in Eq. ([3|) 
holds if and only if, in addition to Eq.®, the mutual 
information takes the maximal value for the fixed tt 
and hence satisfies I[a:‘;y*] = H[?/*], under suitable 
conditions (see supplemental materials). If the neural 
network has sufficient capacity, it is expected that 
there is some optimal tt that maximizes both Ix^y and 
l[x^\y*], simultaneously. In this case, the above analysis 
implies that the optimal tt is obtained by maximizing 
—E[0(,] with respect to tt and v. In conclusion, we find 
that, for a neural network with a large capacity, the 
maximization of —E[0(,] leads to the maximization of 
Ix^y and I[a:‘;y*]. Because the maximization of I[a:*;y*] 
served as the definition of Infomax in previous studies 
[Il-Q, the maximization of —E[0(,] provides a generalized 
Infomax. 


Structures of Neural Networks : To maximize —E[0(,], 
the neural network must be able to adjust tt and v 
to optimal conditional distributions through learning. 
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For this purpose, in the remainder of this Letter, we 
parameterize 7r(a;*|?/*) as 

L Ne 

T^{x*\y*) = Y[ Y[ Mxl\y*Axl}k=i)^ (9) 

e=l i=Nt-i + l 

TTi{xl = l\y\{xi}^l\^) = g{el), 

A= 51 Pvigi^U.j)- ho, 

«;<u = E-S’rf + i: (10) 

l<k<d l<k<Nt_i 

Here, g{e*) is the logistic function (1 + tanh(e-))/2. 
Eg nation (flOl) describes the situation in which each 
neuron computes its own transition probability, tt^ , 
through the intermediate units with the adjustable 

parameters pij , and and the constant 

parameter ho. These parameters represent the synaptic 
strengths and intrinsic properties of the neurons and 
intermediate units. Note that we assume a layered 
structure of the system, as illustrated in Fig.2, in which 
neuron a;* in layer £ receives an input g(e*) from the 
neurons in layers 1 through £ — 1 and the environment 
through the .^-th intermediate layer. It is known that an 
arbitrary continuous mapping of and 

to e\ can be approximated by the last two lines in 
Eg. lfTUl) to arbitrary precision if the number of the 
intermediate units. Mi, is sufficiently large 0- Thus, 
any conditional probability of the form given in Eq.@ 
can be represented in terms of g(e-), as in Eo. ([T0|) . 
Increasing the number of layers of the neural network 
increases its capability to represent various conditional 
probabilities. We believe that the capability to represent 
a wide variety of conditional probabilities will allow 
for realization of the optimal tt, and therefore such 
capability is necessary for our purposes. We model v 
in the same way as tt. Note that v is not used for the 
realization of neural states. We consider the situation in 
which only the values of \ogv{x*'\x*'^^,y^) are calculated 
through some biological mechanisms based on the real¬ 
ized states, and y* (see supplemental materials 

for details regarding ii). 

A Simple Model of Animals Learning to Explore 
Environments : We have shown that neural systems can 
maximize the transfer entropy and mutual information 
through a learning mechanism based on a generalized 
fluctuation theorem. In order to characterize the present 
learning mechanism, we must clarify the role of the 
maximization of the transfer entropy in biological 
contexts, while that of the mutual information has been 
investigated in previous studies [H-IH. In the following 
sections, we show that the maximization of the transfer 
entropy can be understood as a mechanism for the active 
exploration by an animal of its environment. In order to 


realization of neural states according to vr 



1 St lyr. £ -th lyr. L -th lyr. 

FIG. 2. (Color online). An illustration of the layered neural 
networks for the modeling of tt {v is modeled in a similar 
manner). 


clearly demonstrate this effect in biological contexts, we 
introduce a learning problem in which an animal seeks 
to obtain rewards (e.g., food, water, etc.) through active 
exploration. 

Concretely, an animal with a neural system represented 
by the state x* moves around in a two-dimensional 
grid. At each position in the grid, a value of a reward 
associated with that position is defined (Fig.3(a)). 
Specifically, in each timestep, the animal takes either 
one step or zero steps, with the number and direction 
determined by the values of the specialized neurons, as 
shown in Fig.3(b). The state of the environment, y*, is 
specified by the position of the animal and the status of 
reward configuration in the grid. At each timestep, the 
animal “receives” the reward r* = r{'j/) at its present 
position. As shown in Fig.3(a), at most positions in the 
grid, the reward takes a negative value fixed throughout 
the simulation. Such a negative reward is interpreted as 
a punishment. The size of the punishment is minimal 
in the center of the grid and increases in each direction 
moving away from the center. At eight (fixed) positions 
in the outer region of the grid, there are positive rewards. 
The value of each is initially R. If such a positive reward 
is visited by the animal, the reward at the position is 
0 for the subsequent 100 timesteps and then reset to 
R. The animal receives inputs from the environment 
as twelve real variables {yl.} (1 < k < 12). The inputs 
{yl} consist of the coordinate values of the animal’s 
position in the grid (fc = 1,2), the presence or absence of 
a reward R at the animal’s current position (i/g = 1 or 0) 
and the values of the rewards at all positions within 
one step of the current position (4 < fc < 12), as shown 
in Fig.3(c). This set of values allows the animal to 
predict the immediate consequence of its movement. 
Initially, the model parameters that determine tt are set 
in such a way that the animal primarily attempts to 
avoid negative rewards, mimicking the innate behavior 
of real animals (see supplemental materials). With this 
model, it is very natural to consider maximization of 
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FIG. 3. (Color). A simple model of an animal exploring in a 
two-dimensional grid: (a) the configuration of reward in the 
grid, (b) animal’s movement according to the values of the 
specialized neurons, (c) surrounding reward values as input 
variables. 


the average reward, E[r*], by adjusting the animal’s 
behavior represented by tt, because animals must do 
so for survival. This maximization problem is called a 
reinforcement learning problem. However, in general, it 
is known that algorithms that simply maximize E[r*] do 
not reach an optimal outcome in most realistic situations 
because there is a lack of new experiences, unless some 
mechanisms for active exploration are included [ 10 , EH. 
In the present case, in order to obtain the rewards 
R to realize a larger E[r‘], the animal must possess a 
mechanism that allows it to explore the outer region and 
tolerate the punishment incurred there. In the following, 
we show that maximization of the transfer entropy in 
addition to the average reward provides this mechanism. 


We consider the following learning problem: 

max -I- /3E[/]) , (II) 

where /3 > 0. First, we theoretically analyze the optimal 
TT for the above problem. Since the neural control over 
the environment is deterministic in the above model; that 
is, for given x* and y*, we have ,y*) = 1 or 0 , the 

optimization of Tr{x^\y*) reduces to that of a{y*~^^\y*) = 
y(y*''"^|a::‘,y*) 7 r(a:*|y‘). As we know from the basic 
theory of reinforcement learning, it is helpful for analysis 
of the maximization problem treated here to consider the 
following functions oi y Gy: 


Vr%Hy) = ^ 

00 

II 

Pi 

-E 

00 





.s^l 

vt^{y) = T^ 

00 

1 

Pi 

II 

"^Pi 

-E 

00 


where 7 is a parameter satisfying 0 < 7 < 1. The above 
quantities with 7 —>■ I represent the average amounts 
of “excess reward” and “excess information”, obtained 
from the initial state y until the system has relaxed into 
the steady state. This is analogous to the definition of 


the “excess heat” in steady-state thermodynamics [ 22 |, 
E^. With these limits, we can prove that the learning 
problem, Eq. (HU, has a unique optimal distribution a* 
of the following form (see supplemental materials): 

a*{y*^^\y*)oiexp[l3{r*+\v}^^,{y*+^)} + v}^^.{y*+^)].{13) 

Inspecting Ea. (fT51) . we understand that the animal 
shows the following three types of behaviors determined 
by the value of j3. In the case with finite /3(> 0), the 
animal moves with high probability in a direction for 
which large future reward is expected, and with small 
(but non-zero) probability in a direction for which 
small future reward is expected. It is known that such 
exploratory behavior, with (infrequent) excursions in 
directions with low expected payoff, is necessary for 
neural systems to find larger rewards [20|, [21|. Contrast¬ 
ingly, in the case with ,5 —>■ 00 , the optimal behavior 
is deterministic, and exploration is stifled. In the case 
with /3 = 0, the animal is completely insensitive to 
the values of reward. Hence, we see behavior that is a 
compromise between the drive to explore and the drive 
to acquire large rewards, represented by Ix^y and E[r*], 
respectively. 

Numerical Simulations : In order to confirm the 
theoretical results obtained in the above, we carried 
out simulations in which we maximized E[/3r‘ — 0^] 
by applying a stochastic gradient algorithm to the 
model depicted in Fig.3 (see supplemental materials 
for the algorithms and discussion of its biological 
counterparts). It is expected that this maximization 
will result in the maximization in Ea. ()Iip . We first 
examine the case with /3 = 00 , i.e., that in which the 
animal attempts to maximize E[r‘] (i? = 600). In this 
case, we observe that the environmental entropy, H[y*], 
decreases monotonically and that E[r*] becomes fixed 
at zero (Fig.4(b),(c)). This indicates that the animal 
has learned only to avoid the outer areas and remains 
for all times at the origin. Hence, the learning has 
essentially failed. By contrast, setting (3 = O.I and 
i? = 0 , we observe that —E[ 0 |)] increases monotonically 
in Fig.4(a). We also observe that H[j/*] increases in 
a similar manner to —E[0^] and that I[a;*;j/‘] almost 
realizes the maximal value, and satisfies I[a::*; 2 /*] = H[y‘] 
(Fig.4(b)). Hence, we have confirmed that the maximiza¬ 
tion of —E[ 0 y leads to exploration and maximization 
of l[x*;y*], as theoretically predicted above. Finally, 
with /3 = O.I and R = 600, we find that the animal is 
able to increase E[r*] through the exploration (Fig.4(c)). 

Conclusion : We have shown on the basis of theo¬ 
retical and numerical analysis that assuming that the 
learning process exhibited by neural systems is based on 
a principle described by a generalized fluctuation theo¬ 
rem, this system will learn an effective form of exploring 
behavior (by maximizing the transfer entropy, 
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FIG. 4. (Color). The values of (a) —E[0*] (the red line is 
out of range), (b) H[i/‘] and I[a:*; y^], and (c) E[r*], during the 
course of learning. Neural networks with L = 4, A^i = 30, 
N 2 = 60, Ni = 62, N 4 , = 64, Mi = M 2 = 120, and M 3 = 
M 4 = 60 were simulated with P = 00 ,0.1, and R — 0, 600. 


and acquiring information about its environment (by 
maximizing the mutual information, l[x^]y*\). Although 
informational quantities other than the transfer entropy 
have been considered as mechanisms for the exploration 
24|-|28l|. it has not been elucidated how those quantities 


are maximized in neural systems. We believe that use of 
the transfer entropy as a mechanism for exploration is 
more plausible, because the present learning mechanism 
can be utilized for it. Although the demonstration 
is limited to the case of Markovian environmental 
dynamics and neural networks without memory, this 
work will be generalized to more complex systems in the 
near future using the foundation laid by the present work. 


This work was supported by JST CREST from 
MEXT. 
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Supplemental Materials 


Proof of The Maximization of The Mutual Information, I[a;*;?/*] : In this section, we prove that the maximization 
of the mutual information, I[a:*; 2 /‘], and Eq.® are equivalent to equality in Eq.Q, under suitable conditions. First, 
we note that we can replace ,y*) by i/{x*\x*^^,y *in Eqs.dS]), (H]), ([5]), ([6]) and ([7]). In this case, equality 

in Eq.Q follows from equality in the Jensen inequality (expF(Z) = E[expE'(Z)] with probability 1): 


_ rt + l r 

^ tr = E[e 


-TJ-e 


H = l. 


(14) 


This implies — 0^ = 0 with probability 1. By rearranging terms, we have 


v{x^\x^+\ 






(15) 


Hence, by including as a conditioning variable of v, we can easily obtain the equality. However, by reducing 
the number of conditioning variables, we can also obtain the maximization of the mutual information, l[x^]y*], as we 
noted in the main text. We prove this in the following. 


First, we obtain an explicit expression of the optimal u(x*\x*'^^,y*) in Eq.® from the following inequality: 
E 


log „.Ji .t = Y. -- ^0- 

r/*.rc*+i*- x'^ 


’ Psix*\y 


)J 


' ps{x*\y*,x*+^) 


(16) 


The above inequality is derived from the inequality E — 1 > log F for positive real F, and thus, the optimality 
condition in Eq. ([8]) is obtained from the equality, E — 1 = log E o E = 1 with probability 1: 




(17) 


Then, in order to analyze the equality condition of Eq.dS]) for iy{x*\x*~^^,y*^), we calculate the difference A between 
the values of —E[0(,] as calculated with Eqs. (|T7)) and (|T5]) . writing 


A = E 


log 




Ps(x*ly 


t rpt-\-\ 


= I[y‘+i;x‘|x‘+\ 2 /‘]. 


Because —E[0(,] = Ix^y in the case with Ea. dT^ . we obtain 


lx^y + F[Ql]=l[y^+^-,x^ 


\y% 


(18) 


(19) 


in the case with Eq. m- 


Next, we show that the condition 

I[2/‘+i;x*|x*+i,2/‘]=0, (20) 

is equivalent to the maximization of the mutual information, 

l[x*-,y*]=R[y% (21) 


on the assumption that the environmental state space y is not too finely partitioned in comparison with the precision 
of neural control over the environment. Precisely, we assume that there is no coarse-grained partition y' of y such 
that the neural control has the same precision on the two partitions y and y' of the environmental state space. We 
also assume that Ps{y*'^^\y*') 1 for all and y*'^^, which holds for most sets of values of the model parameters in a 

general model. A coarse-grained partition y' is a set of subcollections of y such that y' n y" = 0 for any y' ^ y" S y' 
and ily’^y/y' = y. For any such coarse-grained partition y', we require 

I[y‘+i-';a;‘|y‘] ^I[y‘+i;x*|y*], y‘+i e y‘+b'g y, (22) 

where we have defined the random variable which takes values in y' with g Under this assumption, 

we first show that the conditional mutual information, 


I[y‘+i;x*+i|y*] 


j^t+l ^yt+l ^yt 


x*~^^) log 


Ps(y*~''^lx*+\y*') 

Ps(y*~^^ly*) 


(23) 
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must be maximal. Note that the conditional mutual information takes its maximal value and hence satisfies 

( 24 ) 

if and only if = 1 with probability 1. Thus, to obtain the desired result, we show that 

,y*) > 0 for multiple with some y‘ = yo and = xq contradicts Ea. (l20l) . 


(25) 

(26) 


First, we define the set 

y = e :ybs(y‘+V*+^ = xo,y* = yo) > 0} 
and a coarse-grained partition of y as 

y' = {{y}}ye3^\yU{y}. 

Here, 3^ \ y is the relative complement of y in 3^, which consists of all the elements of y that are not contained in y. 
The assumption in Ea. (l22[) requires 

I[y‘+i;x*|y‘] 

= I[y‘+i;x*|y‘+i’',y*] 

> 0, (27) 

where the first equality holds because y*'^^ uniquely determines y*+^’' and thus the additional inclusion of y*+i’' in 
the first term does not affect the value of the conditional mutual information. Also, Ea. (l20l) implies 

I[y*+i;x‘|y‘,x‘+i] = I[y‘+i; x*|y‘+i’', y‘, x‘+i] = 0 . (28) 

Now, recall that the inclusion of additional conditioning variables (in this case, y^~^^’') always reduces the value of the 
conditional mutual information. The right-hand side of the above equation can be written as 

,p,(y*+i|a:‘,y*+i’',y‘,x*+i)' 


E 

= E 
= 0 , 


log- 
log ■ 


Ps{x*~^^\y*~^^)Psiy*~^^\y*~^^'', y*, a;*) J2yt+i,^yPs{x*'^^\y^~^^)ps{y^'^^\y*'^^'', y*) 

Ey.+.^yPs{x*^^¥+^)Ps&+^y^+^’',y\x^) p.(cr‘+i|y‘+i)p«(y‘+i|y‘+'’',y‘) 


(29) 

with the dummy variables and y*^^ having the same (conditional) distributions as y*^^. The above equality 
requires that the argument of the logarithm be 1 with probability 1, since F — 1 > logF, F — 1 = logF o F = 1 and 


-E 


log 


y«(y‘+i|x‘,y‘+i-',y*,x‘+i) 


Ps{,y*~^^\y*~^^’',y*,x*+^) 


^yt+1,/ ^yt 


< 


xt ^yt+l,f ^yt 


= 0 . 


Ps[ 


Ps{x*. 


y‘,y+i)^y«(y*+i|y 

,y*’^^’',y*,a;' 



yt+l 


Ps{x^- 

,y‘+'’', 

ySy+i)^y«(y*+i|y 

,y*+^’',y‘,a:’ 



yt+l 


.i+1 _ 

a^o y*'*’^) > 0 for y‘+^ S y, we 

have 

■' = y, 

y‘,y) 

'^'^t + l^y Psix*~^^ - 

= a;o|y*+^)pH 


>s(y*+i|a::*,y*+b',y*,x*+i) 


rt+r 


- 1 


= y,y 


Ps(y*+^|y*+^’' = y,y*) Y.yt+i^yPs{x^"^^ = a;o|y‘+i)ps(y*+i|y‘+i’' = y,y*,a;*) 
Furthermore, Eq. (1^ with Ea. (l27l) implies 

Ps(y*+^|y*+^’' = y, y*,a:*) 


(30) 


, Vy‘+^ e y. (31) 


Ps{y^^^W^^'' = y,y‘) 


= c ^ 1 , 


for all G y and some y^ and a: , noting 


ys(y*+^|y*+^’',ySa^*) 

Ps(y*+^|y*+^’',y‘) 


= 1, Vy‘+i’'e y \ {y}. 


(32) 


(33) 


















FIG. 5. Representation of the dynamics with an additional neural network as a causal network. 


Here, note that c = 1 in Ea. (l32)) with Ea. (l33l) implies t/‘] = 0, violating the assumption in Ea. (l27t . 

However, this implies 

1= ^ = c ^ = y,y*) = c 1. (34) 

j,t+igy y*+ieF 

This contradiction completes the proof of the maximality of the conditional mutual information, Ea. (l24E 


Next, we show the equivalence of the maximality of the conditional mutual information and the maximality 
of the mutual information. As we have discussed. Eg. (1241) implies 


Ps{y 


.^ + 1 


y‘) = 




E 


yt+igy 


ys(y‘|y*+^)ps(y*+Ma;*+^) 


= 1 , 


(35) 


with probability 1. Then, the assumption Ps{y*^^\y*) ^ 1 implies that Ps{y*\y*^^) is positive for multiple y*^^. Thus, 
the condition Ea. (j35p implies ps(y*~'’^|a;*“*'^) = 1 with probability 1, or equivalently, that the mutual information, 
I[a;‘;y*], is maximal and hence satisfies 


I[a:‘;y‘]=H[y‘]. 


(36) 


Conversely, the equality in Eg. (1361) implies 

I[y‘+i;y‘] = H[y*+i|a;‘+\y‘] - H[y*+i \x\x^+\y^] 

< H[y*+i|a;‘+^] 

= H[y‘]-i[2^‘;y‘] 

= 0. (37) 


Thus we have recovered the condition Eo. (E(Il) . This completes the proof that equality in Eq. is equivalent to 
Eq.dS]) and Eg. d^ . 


In the above proof, we note that different definitions of v also lead to the maximization of the mutual infor¬ 
mation I[a;*;y*] in different manners, although we do not note this point in the main text for the simplicity of 
presentation. Concretely, we consider the model described by the causal network in Eig.5 by splitting a;* into 
and x^f 2 ) 1 define a generalized entropy production as 


Ql = log 


7r(a;‘+/|y‘+i) 






yt+iy 


(38) 


Then, in the same manner as above, we can prove that the equality, —E[0(,] = implies the maximization of 

the mutual information, = H[?/*’'], in some coarse-grained partition y' that satisfies 


I[4);y‘’'|y‘+i]=I[4);y‘|y‘+^]. 

Further results in this direction will be investigated in the future reports. 


( 39 ) 
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Modeling of v with a Neural Network : We compute v in the same way as tt, explicitly writing 


L Ni 

1=1 i=N(-i + l 

Vi{xl = l\{xl}^f\\x*+^,y^) = g[fl+^), 


N' = H - 


mo, 


l<j<Mg 

%),i “ ^jk^k + Z^ ^jk^k^ 2^ ^jkVk 

l<k<Ne_i 


(^) t (1) 

^ — rrij \ 


l<k<N 


l<k<d 


(40) 


Here, the neurons in the ^-th layer receive inputs from {a:^}i</c<iv«_i, and y* through the intermediate units, 
r)fj^ j, with the adjustable parameters Kij, Zjf,, Ujft, and rrij and the constant parameter mg. This computation 
may seem strange, because here the neurons receive inputs from the future states. However, this is not problematic, 
because the goal of the computation is not to realize the states of the neural network but to calculate the value 
of \ogiy(x*\x*^^,y*). Consider the following situation for this computation, for example. The intermediate units in 
the .^-th layer receive inputs at time t + 1 from and also from {xl.}i<k<Nt-i and y* through some time-delay 
mechanisms. These intermediate units send outputs {g{v*(^)^j)}i<j<Me to the neurons in the £-th layer. At this time, 
the *-th neuron in this layer possesses memory of its own state at time t, x\, through some mechanism. Then, the 
i-th neuron can compute the value of >^i{xl\{xl.}^ff^^ ,x*^^ ,y*) as a function of x- and • The value 

of log n is the sum of these values of Vi over the neurons in the neural network. 


Proofs of the Relations Used in the Theoretical Analysis of the Reinforcement Learning Problem : In this sec¬ 
tion, our goal is to prove Eq. m- First, we define the following functions called “value functions” in the field of 
reinforcement learning: 


= E 


E/7j(2/) = E 




.S = l 
00 




t-\-S 

tr 


y =y 


y =y 


v!f<\y) = v^i{y) + P%^]}{y)- 

Then, we can write the learning problem Eq. dD in terms of the value functions as 

P /3E[r*] = lini(l - ^)V^'^\y), Vy € y. 


7^1 


(41) 


(42) 


By definition, the value function satisfies the following recursive relation called the “Bellman equation”: 

yl^Hy*)= Ta{y*^'^\y*) 

yt+i^y 

X { ^’’(y*'^^) - loga(y*'^^|y*) + Ei'^Hy*^^)}. (43) 

Next, we show that for fixed 7 , it is known that an optimal control a*(y*“'"^|y*) maximizes the value function at all 
y (z y, in comparison with suboptimal controls. Explicitly, for any control a, the following inequality holds: 

Ei:^(y) > ci^Hy), Vyew (44) 


In order to prove Eg. (11^ . we consider the following operator called a backup operator, operating on functions of the 
environmental state y*: 

B(t){y^)= max ^ a(y*+i |y*){7r(y*+^) - 7 log a(y*+^ |y‘)-k 7<('(y‘’^^)}- (45) 

yt + 1 ^ 3 ; 
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We first show that this operation results in contraction in the space of functions of environmental states y* with 
respect to max norm: 


(t> ||oo= inax|(^(y*)|. 


(46) 


For two functions (j)i and (/> 2 , a fixed and an operator defined as 

Bc<j){y*)= a(j/‘+^|y‘){7r(y‘+i)-7loga(y*+^|y‘)+7(()(?/*+^)}, 

yt+l^y 


we have 


Bah - Bah Iloo = II Y - h{y^""^)} lie 


- { FI ) II 7(^1 - h) lie 

Uti j 

= 7 II ^1 - ^2 Iloo . 

Then, with the distribution |?/‘) maximizing Bahiv*) (* = 1:2), we have 

II Bh — Bh Iloo = II B^(i),,h — B^(2),*h Iloo 

< max II B^(i),.h - B^(i),.h ||oo 
i 

<'y\\h-h Iloo • 


(47) 


(48) 


(49) 


This proves that the backup operation yields a contraction of the space of functions on the environmental state space 
3^, and that there is a unique fixed point of this operation in this space of functions. Because the backup operation 
always increases the values of any value function at any point in [V, we have Ea. (j44|) . Hence, when we consider the 
maximality condition of Va'*\y) with respect to a{y*~^^\y*), it is sufficient to consider the stationarity condition of 
BVYiy*) by differentiating it with respect to a{y^'^^\y*) and simply putting the derivative of vY{y*'^^) to be zero. 
Solving the stationarity condition with the Lagrange multiplier corresponding to X)y*+i — I; obtain 

a*( 2 /‘+i| 2 /‘) oc exp[/3{r(y‘+i) + (y‘+')} + (50) 

The optimal condition for the learning problem, the maximization of Eq. (1421) , is obtained by taking the limit 7—^1 
in Eg. (15011 . In order to avoid divergence, we need to replace the value functions in Ea. (l50p with the functions defined 
in Eo. lfT^ that represent the “excess reward” and “excess information”. 


Derivation and Biological Plausibility of the Learning Rule : In the gradient ascent method used for the sim¬ 
ulation, we update each parameter 9 £ {p^, hj^\ Kij, } as follows: 

»« = o'+ e {r(/Jr« - etw - 

*• = + (1 -1) *'f'- (51) 

Here, the constant r is a positive real number that is large compared with the mixing time of the dynamics. We set 
e to such a small value that the change in the model parameters in each update does not affect the stationarity on 
a time scale of r. Then, in the above learning rule, the expectation value of the change in the parameter 9 in each 
update is equal to the gradient of E[/3r* — 0(,] with respect to 9 as we show below. Thus, we can regard the learning 
rule as a stochastic gradient ascent algorithm to maximize E[/3r‘ — 0(,]. 


In the gradient ascent method, we must calculate the gradient of the following quantity with respect to 9: 

E[/3r‘-0(,] = p«(2/‘)7r(:r%‘)p(2/*+i|2/*,x‘)^(x*+i|j/*+i){M2/‘+')-0‘.}- 

In this calculation, we find that differentiation of the stationary distribution Ps{y*) is apparently intractable, while 
differentiation of the other components is easily carried out. We note, however, that we do not need to differentiate 
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the stationary distribution explicitly, assuming that the stationary distribution is a smooth function of any model 
parameter 9. In this case, small changes in for r 3> 1 vanish at t and t — and thus terms including the 

derivatives of Ps{y*~'^) are negligible (see also [^). Thus, we can compute the gradient as follows: 




E[/3r(y‘+^)-e‘] 


= lim 

r—>-oo 


= lim 

r—>-CXD 


E 

E 


Ps{y^-l 




86 




s=0 




■r+l d_ l„t-s+l\ t-s+l\ 

W)-eti 




s=0 


^{^.t-s+l |yt-S+l^ 


r^oo oO 




= lim E 

r—¥oo 


T + l 


{;9r(y‘+i) - 0‘}^ _log7r(:r‘-^+i|2/‘-«+i) 


s =0 



8 , 

-E 

0 ‘ 

89 


(52) 


Note that the third equality follows from = 0. In order to decompose the expectation values into time- 

stepwise quantities, we introduce the auxiliary variable ipl, defined through 


and ^l>g = 0 (t<0). 

T OO T 


(53) 


Then, we have 


1 1 P) 

= - E(1 - -)^^ log^(x‘-^+i|2/‘-^+i). 

If the process under consideration is stationary, iljg approaches the long-time average of ^ log7r(x*+^|2/‘+^) as r —> oo 
and t/r —>■ oo. Similarly, assuming that the correlation of I3r{y^~^^) — with ^ log7r(a:‘“'^+^|y*“'^'''^) is small for 
T I and that T :s> t, we have 


E 


T+l 


8 


{/3r(y‘+i) - 0 ^} ^ ^ log^(a;‘-«+i|y‘-«+i) 

s—0 


T 




(54) 




Then, applying a well-known argument in stochastic approximation theory [s^ . we obtain the learning rule given in 
Ea. dSTI) as a stepwise approximation of the gradient in Ea. (l52l) . 


Finally, we derive the exact form of the learning rule with respect to several 9 and present its interpretation. 
Note that Ea. (|5T]l is composed of \ogTT{x*'^^\y^'^^), \ogi'{x*\x*'^^,y^) and their derivatives with respect to 9. First, 
we show that these components are easily calculated in a neuron-wise manner. Note that 0(,, \og'K(x*'^^\y*'^^) and 
logt^(x*|a;*+^,j/‘) are decomposed as 


where 


0 t = log^(x‘+i|y*+i) -logz.(x*|x*+i,y‘) 

L Ni L Ni 

e=li=Ne_i + l e=li=Ne_i + l 

= ^logx(e‘+\a:)+i) - ^ logx(/‘, x)), 

i i 


(55) 


X{a,b) 


g{a) , if 6= I, 
-g{a) , if 6 = 0. 


( 56 ) 
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Then, the derivatives of 0^, log 7 r(x*+^|y‘+^) and \ogv{x*'\x*'^^ ,y*) (with respect to v^^j}, pij, Zj2^ , for example) are 

calculated as follows. First, denoting the derivative of x(a, b) with respect to a as Xaio,, b), 


fi ^^ f) 

—-^log7r(x*+^|y‘+i) = ^ —-(IJfog7r^(x‘+^|y‘+^{x‘+^}fii') 

i=Ni_i + l OVjk 


i +1 


Ni 

vfpt+l r‘+li • 

i=Nt-i + l i ’ * ' 


Xa(e- 


(57) 


log^(x*+i|y‘+i) = ^log7r,(x*+i|y‘+\K+i}fir) 

Optj 

_ Xalei ) ,.(+ 1 , 

x(er‘,x‘+>) 


^logKx‘|x‘+\y*) 

OZjk 


J2 TTm 

i=^i+i dz]k 


Ni 

= E 

2 =A^^_1 + 1 


Xaif!^\xl) 



(58) 


(59) 


_E 

dnij 


log:^(x‘|x‘+^ 


y*) 


log iz, (x ■+^ \{xl}’^l^\x*+^,y*) 


dKi 


Xaifi ,xl) 

xW^\x^ 




(60) 


It should be noted that calculations of the derivatives involve quantities only for related neurons and intermediate units. 
For example, the derivative with respect to used only information regarding and , xl'^^, Pij} of the 

z-th neuron to which the j-th intermediate unit is connected. Thus, we can regard the change in the synaptic strength 
as being determined by the local interactions at the synapse on the j-th intermediate unit. Continuing with this 
line of argument, we can obtain even more realistic forms of learning rules for actual neural systems. However, we do 
not go into detail here, because the argument becomes quite complicated and is beyond the scope of the current study. 


Initial Values of Model Parameters and Values of Learning Parameters Used in the Simulation : In the nu¬ 
merical simulation of our model of learning, we used initial values of the model parameters that results in behavior 
in which the animal primarily attempts to avoid negative reward, mimicking innate behavior of real animals. We set 
the values of the model parameters involved in the inputs to the movement-related neurons as shown in Fig. 6 . A 
neuron controlling motion in one of four directions receives connections with relatively strong positive weights, po, 
from a specialized intermediate unit (for example, from ^ to x%). The intermediate unit receives connections 
from the environmental variables that take the values of the rewards within one step of the animal’s position, 

(4 < fc < 12), with the weight-values Vq, —Vq and 0, as illustrated in Fig. 6 . These initial values of the weight 
parameters make the neurons controlling motion take a value of 1 when relative amounts of the reward in the 
corresponding direction are large. We chose the other weight parameters with small random values in accordance 
with the following: pij ~ [—0.05 : 0.05] (z < A^ 2 ); ~ [—0.05 : 0.05] (£ = 1,2); = 0 if A ^2 < * < -^4 = and 

(z, j) 7 ^ (N, 1), (TV — 1,2), (TV — 2,1), (TV — 3,2); =0 (£ = 3,4 except the red and blue synaptic weights in Fig. 6 ); 

w^f - [-0.05 : 0.05] {I = 1,2); w\f = 0 {£ = 3,4); ho = log20; hf = 0; zcy - [-0.05 : 0.05]; ~ [-0.05 : 0.05]; 

ufk - [-0-05 : 0.05]; ~ [-0.05 : 0.05]; mf = 0; mo = 0. 

In the updates of the model parameters according to Ea. dST]) . we used the following (fixed) values of learning 
parameters: e = 3.0 x 10“®; r = 50. 
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FIG. 6. (Color online). Initial values of model parameters for synaptic weights to the movement-related neurons. 
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